Cluster Analysis in R

build a strong intuition for how they work and how to interpret hierarchical clustering and k-means clustering results
machine-learning
cluster-analysis
r
Author

Victor Omondi

Published

October 19, 2020

Overview

Cluster analysis is a powerful toolkit in the data science workbench. It is used to find groups of observations (clusters) that share similar characteristics. These similarities can inform all kinds of business decisions; for example, in marketing, it is used to identify distinct groups of customers for which advertisements can be tailored. We will explore two commonly used clustering methods - hierarchical clustering and k-means clustering. We’ll build a strong intuition for how they work and how to interpret their results. We’ll develop this intuition by exploring three different datasets: soccer player positions, wholesale customer spending data, and longitudinal occupational wage data.

Libraries

library(dplyr)
library(ggplot2)
library(dummies)
library(readr)
library(dendextend)
library(purrr)
library(cluster)
library(tibble)
library(tidyr)
Warning message:
"package 'tibble' was built under R version 3.6.3"Warning message:
"package 'tidyr' was built under R version 3.6.3"

Calculating distance between observations

Cluster analysis seeks to find groups of observations that are similar to one another, but the identified groups are different from each other. This similarity/difference is captured by the metric called distance. We will calculate the distance between observations for both continuous and categorical features. We will also develop an intuition for how the scales of features can affect distance.

What is cluster analysis?

A form of exploratory data analysis (EDA) where observations are divided into meaningful groups that share common characteristics(features).

When to cluster?

  • Identifying distinct groups of stocks that follow similar trading patterns.
  • Using consumer behavior data to identify distinct segments within a market.

Distance between two observations

Distance vs Similarity

\(distance = 1 - similarity\)

dist() function

two_players = data.frame(X=c(0, 9), Y=c(0,12))
two_players %>%
    dist(method="euclidean")
   1
2 15
three_players = data.frame(X=c(0, 9, -2), Y=c(0,12, 19))
three_players %>%
    dist()
         1        2
2 15.00000         
3 19.10497 13.03840
players <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/94af7037c5834527cc8799a9723ebf3b5af73015/lineup.rds")))
head(players)
xy
-1 1
-2-3
8 6
7-8
-12 8
-15 0
# Plot the positions of the players
ggplot(two_players, aes(x = X, y = Y)) + 
  geom_point() +
  # Assuming a 40x60 field
  lims(x = c(-30,30), y = c(-20, 20))

# Split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]

# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$X - player2$X)^2 + (player1$Y - player2$Y)^2 )
player_distance
15

The dist() function makes life easier when working with many dimensions and observations.

dist(three_players)
         1        2
2 15.00000         
3 19.10497 13.03840
three_players
XY
0 0
912
-219

The importance of scale

when a variable is on a larger scale than other variables in data it may disproportionately influence the resulting distance calculated between the observations.

scale() function by default centers & scales column features.

Measuring distance for categorical data

Dummication in R

dummy.data.frame()

job_survey = read.csv("datasets/job_survey.csv")
job_survey
job_satisfactionis_happy
LowNo
LowNo
Hi Yes
LowNo
MidNo
# Dummify the Survey Data
dummy_survey <- dummy.data.frame(job_survey)

# Calculate the Distance
dist_survey <- dist(dummy_survey, method="binary")
# Print the Distance Matrix
dist_survey
Warning message in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
"non-list contrasts argument ignored"Warning message in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
"non-list contrasts argument ignored"
          1         2         3         4
2 0.0000000                              
3 1.0000000 1.0000000                    
4 0.0000000 0.0000000 1.0000000          
5 0.6666667 0.6666667 1.0000000 0.6666667

this distance metric successfully captured that observations 1 and 2 are identical (distance of 0)

Hierarchical clustering

How do you find groups of similar observations (clusters) in data using the calculated distances? We will explore the fundamental principles of hierarchical clustering - the linkage criteria and the dendrogram plot - and how both are used to build clusters. We will also explore data from a wholesale distributor in order to perform market segmentation of clients using their spending habits.

Comparing more than two observations

Hierarchical clustering

  • Complete Linkage: maximum distance between two sets
  • Single Linkage: minimum distance between two sets
  • Average Linkage: average distance between two sets

Capturing K clusters

Hierarchical clustering in R

  • hclust() function to calculate the iterative linkage steps
  • cutree() function to extract the cluster assignments for the desired number (k) of clusters.

positions of 12 players at the start of a 6v6 soccer match.

head(players)
xy
-1 1
-2-3
8 6
7-8
-12 8
-15 0
dist_players = dist(players, method = "euclidean")
hc_players = hclust(dist_players, method = "complete")

Extracting K clusters

(cluster_assignments <- cutree(hc_players, k=2))
  1. 1
  2. 1
  3. 2
  4. 2
  5. 1
  6. 1
  7. 1
  8. 2
  9. 2
  10. 2
  11. 1
  12. 2
head(
    players_clustered <- players %>%
        mutate(cluster = cluster_assignments), 
    10)
xycluster
-1 11
-2 -31
8 62
7 -82
-12 81
-15 01
-13-101
15 162
21 22
12-152

players_clustered data frame contains the x & y positions of 12 players at the start of a 6v6 soccer game to which we have added clustering assignments based on the following parameters:

  • Distance: Euclidean
  • Number of Clusters (k): 2
  • Linkage Method: Complete

Exploring the clusters

# Count the cluster assignments
count(players_clustered, cluster)
clustern
16
26

Visualizing K Clusters

Because clustering analysis is always in part qualitative, it is incredibly important to have the necessary tools to explore the results of the clustering.

players_clustered %>%
    ggplot(aes(x=x, y=y, color=factor(cluster)))+
    geom_point()

Visualizing the dendrogram

Plotting the dendrogram

plot(hc_players)

# Prepare the Distance Matrix
dist_players <- dist(players)

# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method="complete")
hc_single <- hclust(dist_players, method="single")
hc_average <- hclust(dist_players, method="average")

# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))
plot(hc_complete, main = 'Complete Linkage')
plot(hc_single, main = 'Single Linkage')
plot(hc_average, main = 'Average Linkage')

Height of the tree

An advantage of working with a clustering method like hierarchical clustering is that you can describe the relationships between your observations based on both the distance metric and the linkage metric selected (the combination of which defines the height of the tree).

Cutting the tree

Coloring the dendrogram - height

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, h=15)
plot(dend_colored)

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, h=10)
plot(dend_colored)

Coloring the dendrogram - K

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, k=2)
plot(dend_colored)

cutree() using height

cluster_assignments <- cutree(hc_players, h=15)
cluster_assignments
  1. 1
  2. 1
  3. 2
  4. 3
  5. 4
  6. 4
  7. 5
  8. 2
  9. 6
  10. 3
  11. 4
  12. 6
head(
    players_clustered <- 
    players %>%
        mutate(cluster = cluster_assignments),
    10)
xycluster
-1 11
-2 -31
8 62
7 -83
-12 84
-15 04
-13-105
15 162
21 26
12-153
dist_players <- dist(players, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")

# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)

# Plot the dendrogram
plot(dend_players)

# Color branches by cluster formed from the cut at a height of 20 & plot
dend_20 <- color_branches(dend_players, h = 20)

# Plot the dendrogram with clusters colored below height 20
plot(dend_20)

# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h=40)

# Plot the dendrogram with clusters colored below height 40
plot(dend_40)

The height of any branch is determined by the linkage and distance decisions (in this case complete linkage and Euclidean distance). While the members of the clusters that form below a desired height have a maximum linkage+distance amongst themselves that is less than the desired height.

Making sense of the clusters

Wholesale dataset

  • 45 observations
  • 3 features:
    • Milk Spending
    • Grocery Spending
    • Frozen Food Spending
head(
    ws_customers <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/3558d2b5564714d85120cb77a904a2859bb3d03e/ws_customers.rds"))) , 
    10
)
MilkGroceryFrozen
1110312469 902
2013 6550 909
1897 5234 417
1304 36433045
3199 69861455
4560 9965 934
879 2060 264
6243 6360 824
13316203991809
5302 9785 364

Exploring more than 2 dimensions

  • Plot 2 dimensions at a time
  • Visualize using PCA
  • Summary statistics by feature

Segment wholesale customers

# Calculate Euclidean distance between customers
dist_customers <- dist(ws_customers, method="euclidean")

# Generate a complete linkage analysis 
hc_customers <- hclust(dist_customers, method="complete")

# Plot the dendrogram
plot(hc_customers)

# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h=15000)

# Generate the segmented customers data frame
head(
    segment_customers <- mutate(ws_customers, cluster = clust_customers), 
    10
)
MilkGroceryFrozencluster
1110312469 902 1
2013 6550 909 2
1897 5234 417 2
1304 36433045 2
3199 69861455 2
4560 9965 934 2
879 2060 264 2
6243 6360 824 2
13316203991809 3
5302 9785 364 2

Explore wholesale customer clusters

Since we are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead we will rely on summary statistics to explore these clusters. We will analyze the mean amount spent in each cluster for all three categories.

# Count the number of customers that fall into each cluster
count(segment_customers, cluster)
clustern
1 5
2 29
3 5
4 6
# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h=15000)

# Plot the colored dendrogram
plot(dend_colored)

# Calculate the mean for each category
segment_customers %>% 
  group_by(cluster) %>% 
  summarise_all(list(mean))
clusterMilkGroceryFrozen
1 16950.00012891.400 991.200
2 2512.828 5228.931 1795.517
3 10452.20022550.600 1354.800
4 1249.500 3916.83310888.667
  • Customers in cluster 1 spent more money on Milk than any other cluster.
  • Customers in cluster 3 spent more money on Grocery than any other cluster.
  • Customers in cluster 4 spent more money on Frozen goods than any other cluster.
  • The majority of customers fell into cluster 2 and did not show any excessive spending in any category.

whether they are meaningful depends heavily on the business context of the clustering.

K-means clustering

Build an understanding of the principles behind the k-means algorithm, explore how to select the right k when it isn’t previously known, and revisit the wholesale data from a different perspective.

Introduction to K-means

head(
    lineup <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/94af7037c5834527cc8799a9723ebf3b5af73015/lineup.rds"))) ,
    10
)
xy
-1 1
-2 -3
8 6
7 -8
-12 8
-15 0
-13-10
15 16
21 2
12-15

kmeans()

model = kmeans(lineup, centers = 2)

Assigning clusters

head(model)
$cluster
  1. 2
  2. 2
  3. 1
  4. 1
  5. 2
  6. 2
  7. 2
  8. 1
  9. 1
  10. 1
  11. 2
  12. 1
$centers
xy
14.83333 0.1666667
-11.33333 -0.5000000
$totss
3489.91666666667
$withinss
  1. 863.666666666667
  2. 570.833333333333
$tot.withinss
1434.5
$betweenss
2055.41666666667
model$cluster
  1. 2
  2. 2
  3. 1
  4. 1
  5. 2
  6. 2
  7. 2
  8. 1
  9. 1
  10. 1
  11. 2
  12. 1
head(
    lineup_clustered <- lineup %>%
        mutate(cluster=model$cluster), 
    10
)
xycluster
-1 12
-2 -32
8 61
7 -81
-12 82
-15 02
-13-102
15 161
21 21
12-151

K-means on a soccer field

# Build a kmeans model
model_km2 <- kmeans(lineup, centers = 2)

# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster

# Create a new data frame appending the cluster assignment
lineup_km2 <- mutate(lineup, cluster = clust_km2)

# Plot the positions of the players and color them using their cluster
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

K-means on a soccer field (part 2)

# Build a kmeans model
model_km3 <- kmeans(lineup, centers=3)

# Extract the cluster assignment vector from the kmeans model
clust_km3 <- model_km3$cluster

# Create a new data frame appending the cluster assignment
lineup_km3 <- mutate(lineup, cluster=clust_km3)

# Plot the positions of the players and color them using their cluster
ggplot(lineup_km3, aes(x = x, y = y, color = factor(cluster))) +
  geom_point()

Evaluating different values of K by eye

Generating the elbow plot

tot_withinss <- map_dbl(1:10, function(k){
    model_l <- kmeans(lineup, centers = k)
    model_l$tot.withinss
})
head(
    elbow_df_l <- data.frame(
    k=1:10,
    tot_withinss = tot_withinss
), 10)
ktot_withinss
1 3489.9167
2 1434.5000
3 881.2500
4 622.5000
5 481.6667
6 265.1667
7 350.5000
8 96.5000
9 82.0000
10 73.5000

Generating the elbow plot

elbow_df_l %>%
    ggplot(aes(x=k, y=tot_withinss)) +
    geom_line() +
    scale_x_continuous(breaks = 1:10)

Silhouette analysis: observation level performance

Silhouette analysis

Silhouette analysis allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters. This metric (silhouette width) ranges from -1 to 1 for each observation in your data and can be interpreted as follows:

  • Values close to 1 suggest that the observation is well matched to the assigned cluster
  • Values close to 0 suggest that the observation is borderline matched between two clusters
  • Values close to -1 suggest that the observations may be assigned to the wrong cluster

Calculating S(i)

pam_k3 <- pam(lineup, k=3)
head(pam_k3$silinfo$widths, 10)
clusterneighborsil_width
41 2 0.465320054
21 3 0.321729341
101 2 0.311385893
11 3 0.271890169
92 1 0.443606497
82 1 0.398547473
122 1 0.393982685
32 1 -0.009151755
113 1 0.546797052
63 1 0.529967901

Silhouette plot

sil_plot <- silhouette(pam_k3)
plot(sil_plot)

# Generate a k-means model using the pam() function with a k = 2
pam_k2 <- pam(lineup, k = 2)

# Plot the silhouette visual for the pam_k2 model
plot(silhouette(pam_k2))

for k = 2, no observation has a silhouette width close to 0? What about the fact that for k = 3, observation 3 is close to 0 and is negative? This suggests that k = 3 is not the right number of clusters.

Average silhouette width

pam_k3$silinfo$avg.width
0.353414012920685
  • 1: Well matched to each cluster
  • 0: Onborder between clusters
  • -1: Poorly matched to each cluster

Highest average silhouette width

sil_width <- map_dbl(2:10, function(k){
    model_sil <- pam(lineup, k=k)
    model_sil$silinfo$avg.width
})
head(sil_df <- data.frame(
    k = 2:10,
    sil_width = sil_width
), 10)
ksil_width
2 0.4164141
3 0.3534140
4 0.3535534
5 0.3724115
6 0.3436130
7 0.3236397
8 0.3275222
9 0.2547311
10 0.2099424

Choosing K using average silhouette width

sil_df %>%
    ggplot(aes(x=k, y=sil_width)) +
    geom_line() +
    scale_x_continuous(breaks = 2:10)

Making sense of the K-means clusters

Segmenting with K-means

  • Estimate the “best” k using average silhouette width
  • Run k-means with the suggested k
  • Characterize the spending habits of these clusters of customers

Revisiting wholesale data: “Best” k

head(ws_customers, 10)
MilkGroceryFrozen
1110312469 902
2013 6550 909
1897 5234 417
1304 36433045
3199 69861455
4560 9965 934
879 2060 264
6243 6360 824
13316203991809
5302 9785 364
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(x = ws_customers, k = k)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

k = 2 has the highest average sillhouette width and is the “best” value of k we will move forward with.

Revisiting wholesale data: Exploration

From the previous analysis we have found that k = 2 has the highest average silhouette width. We will continue to analyze the wholesale customer data by building and exploring a kmeans model with 2 clusters.

set.seed(42)

# Build a k-means model for the customers_spend with a k of 2
model_customers <- kmeans(ws_customers, centers=2)

# Extract the vector of cluster assignments from the model
clust_customers <- model_customers$cluster

# Build the segment_customers data frame
segment_customers <- mutate(ws_customers, cluster = clust_customers)

# Calculate the size of each cluster
count(segment_customers, cluster)
clustern
1 35
2 10
# Calculate the mean for each category
segment_customers %>% 
  group_by(cluster) %>% 
  summarise_all(list(mean))
clusterMilkGroceryFrozen
1 2296.257 5004 3354.343
2 13701.10017721 1173.000

It seems that in this case cluster 1 consists of individuals who proportionally spend more on Frozen food while cluster 2 customers spent more on Milk and Grocery. When we explored this data using hierarchical clustering, the method resulted in 4 clusters while using k-means got us 2. Both of these results are valid, but which one is appropriate for this would require more subject matter expertise. Generating clusters is a science, but interpreting them is an art.

Case Study: National Occupational mean wage

Explore how the average salary amongst professions have changed over time.

Occupational wage data

data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. We would like to use this data to identify clusters of occupations that maintained similar income trends.

oes <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/1e1ec9f146a25d7c71a6f6f0f46c3de7bcefd36c/oes.rds")))
head(oes)
200120022003200420052006200720082010201120122013201420152016
Management70800 78870 83400 87090 88450 91930 96150 100310105440107410108570110550112490115020118020
Business Operations50580 53350 56000 57120 57930 60000 62410 64720 67690 68740 69550 71020 72410 73800 75070
Computer Science60350 61630 64150 66370 67100 69240 72190 74500 77230 78730 80180 82010 83970 86170 87880
Architecture/Engineering56330 58020 60390 63060 63910 66190 68880 71430 75550 77120 79000 80100 81520 82980 84300
Life/Physical/Social Sci.49710 52380 54930 57550 58030 59660 62020 64280 66390 67470 68360 69400 70070 71220 72930
Community Services34190 34630 35800 37050 37530 39000 40540 41790 43180 43830 44240 44710 45310 46160 47200
  • 22 Occupation Observations
  • 15 Measurements of Average Income from 2001-2016
glimpse(oes)
 num [1:22, 1:15] 70800 50580 60350 56330 49710 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:22] "Management" "Business Operations" "Computer Science" "Architecture/Engineering" ...
  ..$ : chr [1:15] "2001" "2002" "2003" "2004" ...
summary(oes)
      2001            2002            2003            2004      
 Min.   :16720   Min.   :17180   Min.   :17400   Min.   :17620  
 1st Qu.:26728   1st Qu.:27393   1st Qu.:27858   1st Qu.:28535  
 Median :34575   Median :35205   Median :36180   Median :37335  
 Mean   :37850   Mean   :39701   Mean   :41018   Mean   :42275  
 3rd Qu.:49875   3rd Qu.:53108   3rd Qu.:55733   3rd Qu.:57443  
 Max.   :70800   Max.   :78870   Max.   :83400   Max.   :87090  
      2005            2006            2007            2008       
 Min.   :17840   Min.   :18430   Min.   :19440   Min.   : 20220  
 1st Qu.:29043   1st Qu.:29688   1st Qu.:30810   1st Qu.: 31643  
 Median :37790   Median :39030   Median :40235   Median : 41510  
 Mean   :42775   Mean   :44329   Mean   :46074   Mean   : 47763  
 3rd Qu.:58005   3rd Qu.:59915   3rd Qu.:62313   3rd Qu.: 64610  
 Max.   :88450   Max.   :91930   Max.   :96150   Max.   :100310  
      2010             2011             2012             2013       
 Min.   : 21240   Min.   : 21430   Min.   : 21380   Min.   : 21580  
 1st Qu.: 32863   1st Qu.: 33430   1st Qu.: 33795   1st Qu.: 34120  
 Median : 42995   Median : 43610   Median : 44055   Median : 44565  
 Mean   : 49758   Mean   : 50555   Mean   : 51077   Mean   : 51800  
 3rd Qu.: 67365   3rd Qu.: 68423   3rd Qu.: 69253   3rd Qu.: 70615  
 Max.   :105440   Max.   :107410   Max.   :108570   Max.   :110550  
      2014             2015             2016       
 Min.   : 21980   Min.   : 22850   Min.   : 23850  
 1st Qu.: 34718   1st Qu.: 35425   1st Qu.: 36350  
 Median : 45265   Median : 46075   Median : 46945  
 Mean   : 52643   Mean   : 53785   Mean   : 55117  
 3rd Qu.: 71825   3rd Qu.: 73155   3rd Qu.: 74535  
 Max.   :112490   Max.   :115020   Max.   :118020  
dim(oes)
  1. 22
  2. 15

there are no missing values, no categorical and the features are on the same scale.

Next steps: hierarchical clustering

  • Evaluate whether pre-processing is necessary
  • Create a distance matrix
  • Build a dendrogram
  • Extract clusters from dendrogram
  • Explore resulting clusters

Hierarchical clustering: Occupation trees

The oes data is ready for hierarchical clustering without any preprocessing steps necessary. We will take the necessary steps to build a dendrogram of occupations based on their yearly average salaries and propose clusters using a height of 100,000.

# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = "euclidean")

# Generate an average linkage analysis 
hc_oes <- hclust(dist_oes, method = "average")

# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)

# Plot the dendrogram
plot(dend_oes)

# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)

# Plot the colored dendrogram
plot(dend_colored)

Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another. Let’s continue this exploration.

Hierarchical clustering: Preparing for exploration

We have now created a potential clustering for the oes data, before we can explore these clusters with ggplot2 we will need to process the oes data matrix into a tidy data frame with each occupation assigned its cluster.

# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')

# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)

# Generate the segmented the oes data frame
clust_oes <- mutate(df_oes, cluster = cut_oes)

# Create a tidy data frame by gathering the year and values into two columns
head(gathered_oes <- gather(data = clust_oes, 
                       key = year, 
                       value = mean_salary, 
                       -occupation, -cluster), 10)
occupationclusteryearmean_salary
Management 1 2001 70800
Business Operations 2 2001 50580
Computer Science 2 2001 60350
Architecture/Engineering 2 2001 56330
Life/Physical/Social Sci. 2 2001 49710
Community Services 3 2001 34190
Legal 1 2001 69030
Education/Training/Library3 2001 39130
Arts/Design/Entertainment 3 2001 39770
Healthcare Practitioners 2 2001 49930

Hierarchical clustering: Plotting occupational clusters

We have succesfully created all the parts necessary to explore the results of this hierarchical clustering work. We will leverage the named assignment vector cut_oes and the tidy data frame gathered_oes to analyze the resulting clusters.

# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)
Management
1
Legal
1
Business Operations
2
Computer Science
2
Architecture/Engineering
2
Life/Physical/Social Sci.
2
Healthcare Practitioners
2
Community Services
3
Education/Training/Library
3
Arts/Design/Entertainment
3
Healthcare Support
3
Protective Service
3
Food Preparation
3
Grounds Cleaning & Maint.
3
Personal Care
3
Sales
3
Office Administrative
3
Farming/Fishing/Forestry
3
Construction
3
Installation/Repair/Maint.
3
Production
3
Transportation/Moving
3
# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) + 
    geom_line(aes(group = occupation))

From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years. Let’s see what we can get by exploring this data using k-means.

Reviewing the HC results

Next steps: k-means clustering

  • Evaluate whether pre-processing is necessary
  • Estimate the “best” k using the elbow plot
  • Estimate the “best” k using the maximum average silhouette width
  • Explore resulting clusters

K-means: Elbow analysis

leverage the k-means elbow plot to propose the “best” number of clusters.

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = oes, centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10)

K-means: Average Silhouette Widths

So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. We will use average silhouette widths to explore what the “best” value of k should be.

# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10,  function(k){
  model <- pam(oes, k = k)
  model$silinfo$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
  k = 2:10,
  sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
  geom_line() +
  scale_x_continuous(breaks = 2:10)

It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).

Comparing the two clustering methods

Hierarchical Clustering k-means
Distance Used: virtually any euclidean only
Results Stable: Yes No
Evaluating # of Clusters: dendrogram, silhouette,elbow silhouette,elbow
Computation Complexity: Relatively Higher RelativelyLower